May 14, 2020

Developing Data Products - Candy Data

This presentation is an assignment for week 3 of John Hopkins University/Coursera's Developing Data Products course. The requirements are simple:

  1. The project must contain a date that is within 2 months of the grade date.
  2. The project is a web page presentation containing an interactive plot created with Plotly.

To have some fun with this project, I chose to use FiveThirtyEight's candy ranking data from "The Ultimate Halloween Candy Power Ranking." (source in last slide)

The Data

For fun Halloween analysis, FiveThirtyEight asked its readers to vote in random match-ups of 83 candies and two coins (a quarter or a dime) to choose the most desirable Halloween handout. Author Walt Hickey then scoured for data on said candies - price per unit, chocolate content, nougat, nuts, etc - to find out if there was some method to the sugar-induced madness. Most of the variables are binary 0/1 for 'Yes/No' about a candy. Three columns are percentages: sugar content, price, and win percentage (the outcome).

##              chocolate fruity caramel peanutyalmondy nougat crispedricewafer
## 100 Grand            1      0       1              0      0                1
## 3 Musketeers         1      0       0              0      1                0
## One dime             0      0       0              0      0                0
## One quarter          0      0       0              0      0                0
## Air Heads            0      1       0              0      0                0
## Almond Joy           1      0       0              1      0                0
##              hard bar pluribus sugarpercent pricepercent winpercent
## 100 Grand       0   1        0        0.732        0.860   66.97173
## 3 Musketeers    0   1        0        0.604        0.511   67.60294
## One dime        0   0        0        0.011        0.116   32.26109
## One quarter     0   0        0        0.011        0.511   46.11650
## Air Heads       0   0        0        0.906        0.511   52.34146
## Almond Joy      0   1        0        0.465        0.767   50.34755

Analysis

Walt's linear regression found that 9 variables accounted for about half of the variation in his model. My basic regression here has very similar results, although not exactly identical.

## 
## Call:
## lm(formula = winpercent ~ ., data = candy.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2244  -6.6247   0.1986   6.8420  23.8680 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       34.5340     4.3199   7.994 1.44e-11 ***
## chocolate         19.7481     3.8987   5.065 2.96e-06 ***
## fruity             9.4223     3.7630   2.504  0.01452 *  
## caramel            2.2245     3.6574   0.608  0.54493    
## peanutyalmondy    10.0707     3.6158   2.785  0.00681 ** 
## nougat             0.8043     5.7164   0.141  0.88849    
## crispedricewafer   8.9190     5.2679   1.693  0.09470 .  
## hard              -6.1653     3.4551  -1.784  0.07852 .  
## bar                0.4415     5.0611   0.087  0.93072    
## pluribus          -0.8545     3.0401  -0.281  0.77945    
## sugarpercent       9.0868     4.6595   1.950  0.05500 .  
## pricepercent      -5.9284     5.5132  -1.075  0.28578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.7 on 73 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.4709 
## F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09

Data for Plotting

This plot contains two win percentages variables: the average win percentages of candy with the given attribute, and the regressed win percentages for only that attribute. For example, many candies that scored highly have chocolate, but they also may have caramel or nuts. The blue column is directly from the data, and is the means of all chocolate candy wins regardless of what other attributes they may have. The orange column is the regressed wins of each variable, so when added with the intercept is creates an expected win percentage for candy that has only that one attribute, and no others. The intercept, around 34%, is the expected win percentage of a candy with no attributes (say, one of the coins), and the attributes either add to or decrease that base percentage.

Slide with Plot

Summary

Chocolate, fruit, crispy, and nutty attributes have the greatest positive impact on win percentage. The clear conclusion here is the need for a master franken-candy containing all of the above characteristics. An interesting note is that 'chocolate' and 'fruity' are, despite both contributing to high rankings, are nearly mutually exclusive attributes. The sole exception is Tootsie Pops, which ranked at #42 with a wins at 49%. The chocolate + nuts combo proved powerful with Reese's Peanut Butter Cups and spinoff candies dominating the top of the rankings.

Sources

Raw Data: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

Original article: Walt Hickey, Oct. 17, 2017. FiveThirtyEight.com https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/